Aino Peura 28.11.2021 These are week 4 excercises.
#install.packages("MASS")
#install.packages("corrplot")
#install.packages("Hmisc")
#install.packages("GGally")
library(MASS)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.6.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.6.3
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 3.6.3
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(GGally)
## Warning: package 'GGally' was built under R version 3.6.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Boston
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Boston-dataframe describes housing values in suburbs of Boston. It has 506 samples and 14 variables. Following info is from the RDocumentation of the MASS-package:
- crim: per capita crime rate by town.
- zn: proportion of residential land zoned for lots over 25,000 sq.ft.
- indus: proportion of non-retail business acres per town.
- chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- nox: nitrogen oxides concentration (parts per 10 million).
- rm: average number of rooms per dwelling.
- age: proportion of owner-occupied units built prior to 1940.
- dis: weighted mean of distances to five Boston employment centres.
- rad: index of accessibility to radial highways.
- tax: full-value property-tax rate per $10,000.
- ptratio: pupil-teacher ratio by town.
- black: \(1000(Bk - 0.63)^2\) where \(Bk\) is the proportion of blacks by town.
- lstat: lower status of the population (percent).
- medv: median value of owner-occupied homes in $1000s.
The first task is to show graphical overview of the data, summaries of the variables and relationships between variables in the data. This will be done with 3 parts: - Graphical overwies of the variables are shown with following:
#With following code, density graphs are counted (density ()) and plotted to show the distribution of variables
par(mfrow=c(2,4))
for(i in 1:14){
nimi<- colnames(Boston)[i]
data<- density(Boston[,i])
plot(data, main = nimi)
}
From quick observation, we can see that by very inaccurate visual interpretation:
- variables that have peak on the small values:
- crim: per capita crime rate by town.
- zn: proportion of residential land zoned for lots over 25,000 sq.ft.
- chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- dis: weighted mean of distances to five Boston employment centres.
- variables that have peak on high values:
- age: proportion of owner-occupied units built prior to 1940.
- ptratio: pupil-teacher ratio by town.
- black: \(1000(Bk - 0.63)^2\) where \(Bk\) is the proportion of blacks by town.
- variables that are normally distributed:
- lstat: lower status of the population (percent).
- medv: median value of owner-occupied homes in $1000s.
- nox: nitrogen oxides concentration (parts per 10 million).
- rm: average number of rooms per dwelling.
-variables that have two peaks:
- indus: proportion of non-retail business acres per town.
- rad: index of accessibility to radial highways.
- tax: full-value property-tax rate per $10,000.
Then, lets see the summaries of our data:
for(i in 1:14){
nimi<- colnames(Boston)[i]
data<- summary(Boston[,i])
print(nimi)
print(data)
}
## [1] "crim"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
## [1] "zn"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 11.36 12.50 100.00
## [1] "indus"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.46 5.19 9.69 11.14 18.10 27.74
## [1] "chas"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06917 0.00000 1.00000
## [1] "nox"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3850 0.4490 0.5380 0.5547 0.6240 0.8710
## [1] "rm"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.561 5.886 6.208 6.285 6.623 8.780
## [1] "age"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.90 45.02 77.50 68.57 94.08 100.00
## [1] "dis"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.130 2.100 3.207 3.795 5.188 12.127
## [1] "rad"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 9.549 24.000 24.000
## [1] "tax"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187.0 279.0 330.0 408.2 666.0 711.0
## [1] "ptratio"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
## [1] "black"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.32 375.38 391.44 356.67 396.23 396.90
## [1] "lstat"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.73 6.95 11.36 12.65 16.95 37.97
## [1] "medv"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 17.02 21.20 22.53 25.00 50.00
And finally, correlation plot: The hierarchail clustering is used for ordering variables by using “order =”hclust""
par(mfrow=c(1,2))
#Lets count correlation matrix
correlations<-cor(Boston)
#Lets count significances:
testRes<- cor.mtest(Boston)
#Lets plot it, variables that are blank are unsignificant
corrplot(correlations, p.mat =testRes$p, insig = "blank",method = "number", order = "hclust", addrect = 2, number.cex = 0.75)
From the correlation plot, we can see that (these include only significant correlation values: - variables ptratio, lstat, age, indus, nox, crim, rad and tax seem to correlate positively -variables dis, medv and rm seem to correlate positively with chas, black, and zn and obviously with themselves -chas only correlates positively with rm, medv and dis and negatively with ptratio -The strognest positive correlation is between rad and tax
So, the next task is to scale Boston-dataset.
BostonScaled<- as.data.frame(scale(Boston))
#Lets plot summaries of scaled boston dataset and under each column also the uncsaled data so we can see what changed:
for(i in 1:14){
nimi<- colnames(BostonScaled)[i]
data1<- summary(BostonScaled[,i])
data2<-summary(Boston[,i])
print(nimi)
print(data1)
print(data2)
}
## [1] "crim"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
## [1] "zn"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.48724 -0.48724 -0.48724 0.00000 0.04872 3.80047
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 11.36 12.50 100.00
## [1] "indus"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.5563 -0.8668 -0.2109 0.0000 1.0150 2.4202
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.46 5.19 9.69 11.14 18.10 27.74
## [1] "chas"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2723 -0.2723 -0.2723 0.0000 -0.2723 3.6648
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06917 0.00000 1.00000
## [1] "nox"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4644 -0.9121 -0.1441 0.0000 0.5981 2.7296
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3850 0.4490 0.5380 0.5547 0.6240 0.8710
## [1] "rm"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.8764 -0.5681 -0.1084 0.0000 0.4823 3.5515
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.561 5.886 6.208 6.285 6.623 8.780
## [1] "age"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.3331 -0.8366 0.3171 0.0000 0.9059 1.1164
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.90 45.02 77.50 68.57 94.08 100.00
## [1] "dis"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2658 -0.8049 -0.2790 0.0000 0.6617 3.9566
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.130 2.100 3.207 3.795 5.188 12.127
## [1] "rad"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.9819 -0.6373 -0.5225 0.0000 1.6596 1.6596
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 9.549 24.000 24.000
## [1] "tax"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.3127 -0.7668 -0.4642 0.0000 1.5294 1.7964
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187.0 279.0 330.0 408.2 666.0 711.0
## [1] "ptratio"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.7047 -0.4876 0.2746 0.0000 0.8058 1.6372
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
## [1] "black"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.9033 0.2049 0.3808 0.0000 0.4332 0.4406
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.32 375.38 391.44 356.67 396.23 396.90
## [1] "lstat"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.5296 -0.7986 -0.1811 0.0000 0.6024 3.5453
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.73 6.95 11.36 12.65 16.95 37.97
## [1] "medv"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.9063 -0.5989 -0.1449 0.0000 0.2683 2.9865
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 17.02 21.20 22.53 25.00 50.00
Okay, from values above, we can see that scale-function scaled everything to have mean as 0 and also every value now represents how many standard devitions it is from the mean. These scaled values are actually called as z-score.
Then, the next task is to divide variable crime-rate into quantiles and create a categorical variable called crime:
#Lets count quantiles
bins<- quantile(BostonScaled$crim)
#Lets set neq variable with this ifelse-script.
#We start with smallest and in no-option we include the test to categorize it for further
BostonScaled$crime<- ifelse(BostonScaled$crim<bins[2], paste("[", round(bins[1], digits = 3), ",", round(bins[2], digits = 3), "]", sep = ""),
ifelse(BostonScaled$crim<bins[3], paste("(", round(bins[2], digits = 3), ",", round(bins[3], digits = 3), "]", sep = ""),
ifelse(BostonScaled$crim<bins[4], paste("(", round(bins[3], digits = 3), ",", round(bins[4], digits = 3), "]", sep = ""), paste("(",round(bins[4], digits = 3), ",", round(bins[5], digits = 3), "]", sep = "") )))
table(BostonScaled$crime)
##
## (-0.39,0.007] (-0.411,-0.39] (0.007,9.924] [-0.419,-0.411]
## 126 126 127 127
BostonScaled<- BostonScaled[,2:15]
#In datacamp these were changed into following classes: low, med_low, med_high, high, so, just for clarification, we shall do the same
unique(BostonScaled$crime)
## [1] "[-0.419,-0.411]" "(-0.411,-0.39]" "(-0.39,0.007]" "(0.007,9.924]"
BostonScaled$crime<- ifelse(BostonScaled$crime=="[-0.419,-0.411]", "low", BostonScaled$crime)
BostonScaled$crime<- ifelse(BostonScaled$crime=="(-0.411,-0.39]", "low_med", BostonScaled$crime)
BostonScaled$crime<- ifelse(BostonScaled$crime=="(-0.39,0.007]", "high_med", BostonScaled$crime)
BostonScaled$crime<- ifelse(BostonScaled$crime=="(0.007,9.924]", "high", BostonScaled$crime)
table(BostonScaled$crime)
##
## high high_med low low_med
## 127 126 127 126
All good! Okay, now we have created the variable crime, Then, lets divide the data into train and test sets (80% into train, 20% into test:
#Lets create our tran-data subset
trainData<- BostonScaled[(sample(nrow(BostonScaled), size= 0.8*nrow(BostonScaled))),]
#And the leftovers are testdata
testData<- subset(BostonScaled, is.element(rownames(BostonScaled), rownames(trainData))==F)
So, the next task to do is to fit linear discriminant analysis into he train dataset.
#LDA_analysis
ldadata<- lda(crime ~ ., data = trainData)
#Lets create numeric crime vector maintaining the original arrangement of values
trainData$crime<- ifelse(trainData$crime=="low", 1, ifelse(trainData$crime=="low_med", 2, ifelse(trainData$crime=="high_med", 3, 4)))
#Then the arrows-function from datacamp
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#Lets define classes-variable
classes <- as.numeric(trainData$crime)
#lets create biplot and also the arrows, classes defines colors and also symbols for variables
plot(ldadata, dimen = 2, col = classes, pch = classes)
lda.arrows(ldadata, myscale = 1)
First, lets save correct classes of crime-variable in test set:
correct <- testData$crime
testData<- select(testData, -crime)
Then, lets do prediction of correct classes in the test data:
ldaprediction<- predict(ldadata, newdata = testData)
table(correct= correct, predicted = ldaprediction$class)
## predicted
## correct high high_med low low_med
## high 23 0 0 0
## high_med 1 17 0 10
## low 0 1 15 13
## low_med 0 2 7 13
So, for some reason, my predictive function does not predict correctly classes low_med and low, some low values are predicted to be low_med. However, with every other type of variable, predicions are quite accurate:
- 100% high values are predicted correctly
- 64% of high_med values are in correct category
- 57% of low_med values are in correct gategory - BUT only 29% of low values are in low category
So, lets do bostonScaled2, that is the original Boston dataset scaled
BostonScaled2<-scale(Boston)
Then, lets count euclidean and manhattan distances of the dataset:
eu<- dist(BostonScaled2, method = "euclidean")#This is the hypothenuse distance, so absolutely the shortest distance
man<- dist(BostonScaled2, method = "manhattan")#This is the sum of individual distances between variables (like in manhattan when you have to navigate between square-shaped houses)
Now, lets do k-means clustering:
#first, lets set seed
set.seed(56)
#Then, lets count the WCSS (within cluster sum of squares) and define the optimal number of clusters (when this drops drastically)
#We set the maximal number of clusters to be 10, just for the sake of time and computer
#we do kmeans from k==1 to k ==10, and we have tot.whitinss as a outcome variable that is saved into WCSS
WCSS<- sapply(1:10, function(k){kmeans(BostonScaled2, k)$tot.withinss})
qplot(x=1:10, y=WCSS, geom="line")
#From here we can see that the greatest drop (so biggest variation is somewhere around 2, so optimal number of clusters is 2)
BostonScaled2<- as.data.frame(BostonScaled2)
km<-kmeans(BostonScaled2, centers = 2)
BostonScaled2$cluster<- km$cluster
BostonScaled2$cluster<- as.factor(BostonScaled2$cluster)
ggpairs(BostonScaled2, upper = "blank", diag = "blank", mapping=ggplot2::aes(colour = cluster))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Interpretation of the data: Even though our data is in very tiny images, we can see the distribution of colors: Two clusters separate the variable crim beautifully. Also zn is beautifully divided. Same is true for indus, dis and nox variables. However, I don’t think other variables are separating by these two clusters.
BostonScaled3<- scale(Boston)
km<-kmeans(BostonScaled2, centers = 5)
BostonScaled3<- as.data.frame(BostonScaled3)
BostonScaled3$cluster<- km$cluster
#then, lets perform LDA wth clusters as target classes
LDA<- ldadata<- lda(cluster ~ ., data = BostonScaled3)
#Lets define the function Lda-arrows again.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#Lets define classes-variable
classes <- as.numeric(BostonScaled3$cluster)
#lets create biplot and also the arrows, classes defines colors and also symbols for variables
plot(LDA, dimen = 2, col = classes, pch = classes)
lda.arrows(LDA, myscale = 3)
So, lets interprate results. nec, induce. rad are great separators of clusters. Age and crim, also zn are separating clusters 1 and 2 from 4,3 and 5. I would say that age is the strongest linear separator, because it has longest vector.
First, lets run the code:
model_predictors <- select(trainData, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
#For some reasons the crime was till included, while it was not in trainData, so we skip the row 1
ldadata$scaling<- ldadata$scaling[2:14,]
dim(ldadata$scaling)
## [1] 13 4
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% ldadata$scaling
matrix_product <- as.data.frame(matrix_product)
#Lets adjust the color
par(mfrow= c(2,2))
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = trainData$crime)
km<- kmeans(trainData, centers = 5)
#Then, lets define the color by cluster
trainData$clusters<- km$cluster
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = trainData$clusters)
Allright, lets do some visual interpretation. Low crime is cluster 1 and high crime is cluster 4. Otherwise, points have different clusters than crime rates.